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Abstract: We present an alternating augmented Lagrangian method for convex optimization 
problems where the cost function is the sum of two terms, one that is separable in the variable 
blocks, and a second that is separable in the difference between consecutive variable blocks. 
Examples of such problems include Fused Lasso estimation, total variation denoising, and multi- 
period portfolio optimization with transaction costs. In each iteration of our method, the first 
step involves separately optimizing over each variable block, which can be carried out in parallel. 
The second step is not separable in the variables, but can be carried out very efficiently. We apply 
the algorithm to segmentation of data based on changes in mean (£i mean filtering) or changes 
in variance {£i variance filtering). In a numerical example, wc show that our implementation is 
around 10000 times faster compared with the generic optimization solver SDPT3. 
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1. INTRODUCTION 

In this paper we consider optimization problems where 
the objective is a sum of two terms: The first term is 
separable in the variable blocks, and the second term is 
separable in the difference between consecutive variable 
blocks. One example is the Fused Lasso method in statis- 
tical learning, Tibshirani et al. [2005], where the objective 
includes an ^i-norm penalty on the parameters, as well as 
an ^i-norm penalty on the difference between consecutive 
parameters. The first penalty encourages a sparse solution, 
i.e., one with few nonzero entries, while the second penalty 
enhances block partitions in the parameter space. The 
same ideas have been applied in many other areas, such as 
Total Variation (TV) denoising, Rudin et al. [1992], and 
segmentation of ARX models, Ohlsson et al. [2010] (where 
it is called sum-of-norms regularization) . Another example 
is multi-period portfolio optimization, where the variable 
blocks give the portfolio in different time periods, the first 
term is the portfolio objective (such as risk-adjusted re- 
turn), and the second term accounts for transaction costs. 

In many applications, the optimization problem involves 
a large number of variables, and cannot be efficiently 
handled by generic optimization solvers. In this paper, 
our main contribution is to derive an efficient and scalable 
optimization algorithm, by exploiting the structure of the 
optimization problem. To do this, we use a distributed 
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optimization method called Alternating Direction Method 
of Multiphers (ADMM). ADMM was developed in the 
1970s, and is closely related to many other optimization 
algorithms including Brcgman iterative algorithms for £i 
problems, Douglas-Rachford splitting, and proximal point 
methods; see Eckstein and Bcrtsekas [1992], Combettes 
and Pesquet [2007]. ADMM has been applied in many 
areas, including image and signal processing, Setzer [2011], 
as well as large-scale problems in statistics and machine 
learning, Boyd et al. [2011]. 

We will apply ADMM to ii mean filtering and ii variance 
filtering (Wahlberg et al. [2011]), which are important 
problems in signal processing with many applications, for 
example in financial or biological data analysis. In some 
applications, mean and variance filtering are used to pre- 
process data before fitting a parametric model. For non- 
stationary data it is also important for segmenting the 
data into stationary subsets. The approach we present is 
inspired by the ii trend filtering method described in Kim 
et al. [2009], which tracks changes in the mean value of 
the data. (An example in this paper also tracks changes in 
the variance of the underlying stochastic process.) These 
problems are closely related to the covariance selection 
problem, Dempster [1972], which is a convex optimization 
problem when the inverse covariance is used as the opti- 
mization variable, Banerjee et al. [2008]. The same ideas 
can also be found in Kim et al. [2009] and Friedman ct al. 
[2008]. 



This paper is organized as follows. In Section 2 we review 
the ADMM method. In Section 3, we apply ADMM to our 
optimization problem to derive an efficient optimization 
algorithm. In Section 4.1 we apply our method to ii 
mean filtering, while in Section 4.2 we consider £i variance 
filtering. Section 5 contains some numerical examples, and 
Section 6 concludes the paper. 

2. ALTERNATING DIRECTION METHOD OF 
MULTIPLIERS (ADMM) 

In this section we give an overview of ADMM. We follow 
closely the development in Section 5 of Boyd et al. [2011]. 

Consider the following optimization problem 

minimize /(a;) , , 

subject to a; e C ^ ' 

with variable x G M", and where / and C are convex. We 
let p* denote the optimal value of (1). We first re- write the 
problem as 

minimize f{x)+Ic(z) 

subject to X = z, 
where Ic{z) is the indicator function on C (i.e., Ic{z) = 
for z G C, and Ic{z) ~ oo for z ^ C). The augmented 
Lagrangian for this problem is 

Lp{x,z,u) = fix) +Ic{z) + {p/2)\\x- z + u\\l, 
where u is a scaled dual variable associated with the 
constraint x = z, i.e., u — {l/p)y, where y is the dual 
variable for x = z. Here, p > is a penalty parameter. 

In each iteration of ADMM, we perform alternating min- 
imization of the augmented Lagrangian over x and z. At 
iteration k we carry out the following steps 

x'^+i := argmin{/(x) + {p/2)\\x ~ z"" + u''\\l} (3) 



(2) 



X 

z^+^ :=nc(x' 
,fe+i 



fe+i 



') (4) 

u--.^u^ + {x^+^ -z''+^), (5) 

where He denotes Euclidean projection onto C. In the 
first step of ADMM, we fix z and u and minimize the 
augmented Lagrangian over x; next, we fix x and u and 
minimize over z; finally, we update the dual variable u. 

2. 1 Convergence 

Under mild assumptions on / and C, we can show that the 
iterates of ADMM converge to a solution; specifically, we 
have 

f(x^)'^p\ x^^z^^O, 
as fc — >■ oo. The rate of convergence, and hence the number 
of iterations required to achieve a specified accuracy, can 
depend strongly on the choice of the parameter p. When 
p is well chosen, this method can converge to a fairly 
accurate solution (good enough for many applications), 
within a few tens of iterations. However, if the choice of 
p is poor, many iterations can be needed for convergence. 
These issues, including heuristics for choosing p, are dis- 
cussed in more detail in Boyd et al. [2011]. 

2.2 Stopping criterion 

The primal and dual residuals at iteration k are given by 



We terminate the algorithm when the primal and dual 
residuals satisfy a stopping criterion (which can vary 
depending on the requirements of the application). A 
typical criterion is to stop when 

II ^11 ^- pri II ^11 ^' dual 

Here, the tolerances e?''' > and e^uai ^ g ^^^ -^^ ^^-^ ^-^^ 
an absolute plus relative criterion, 

gdual^^gate^^relp||^fc||^^ 

where e^*^** > and e''°' > arc absolute and relative 
tolerances (see Boyd et al. [2011] for details). 

3. PROBLEM FORMULATION AND METHOD 

In this section we formulate our problem and derive an 
efficient distributed optimization algorithm via ADMM. 

3.1 Optimization problem 

We consider the problem 

N N~l 

minimize ^ $i(xi) + ^ ^^(ri) .g^ 

2—1 i—1 

subject to ri — x^+i ~ Xi, i = 1, . . . ,N — 1 

with variables xi, . . . , x^v, ri, . . . , r^^^i G R", and where 
$i : R" ^ R U {oo} and *i : R" ^ R U {oo} are convex 
functions. 

This problem has the form (1), with variables x = 
(xi, . . . ,xn)i t — [ti, . . ■ ,r7v_i), objective function 

N N-l 

/(a;,r)=^$,(x,)+5]*^(r,) 

i=l i=l 

and constraint set 

C = {{x, r) I Ti = Xi+i - Xj, i^l,...,N -I}. (7) 

The ADMM form for problem (6) is 



N 



N-1 



minimize ^ $i(xi) + ^ "^iiri) + Ic{2 



(8) 



(x'-z^), 



-p{z' 



,k-i\ 



subject to Ti = Si, i — 1, . . . ,N — 1 

Xi ^ Zi, z = 1, . . . , Af, 

with variables x = (xi, . . . ,xn), r = (ri, . . . , tat^i), 
z = (zi, . . . , zjv), and s = (si, . . . , sjv_i). Furthermore, 
we let u = (ui, . . . , upf) and t — (ti, . . . , ^at-i) be vectors 
of scaled dual variables associated with the constraints 
Xi — Zi, i — I, . . . , N, and r^ = Si, i = 1, . . . , N — 1 [i.e., 
Ui = {l/p)yi, where yi is the dual variable associated with 

Xi Zi J . 

3.2 Distributed optimization method 

Applying ADMM to problem (8), we carry out the follow- 
ing steps in each iteration. 

Step 1. Since the objective function / is separable in Xj 
and ri , the first step (3) of the ADMM algorithm consists 
of 2A^ — 1 separate minimizations 

xf+^ := argmin{$,(x,) + {p/2)\\xi - zf + w,*||2}, (9) 



1, ... ,7V, and 



^fc+i ._ 



argmin{*,(r,) + {p/2)\\r, - sf + t^\\l}, (10) 



i — 1, . . . ,N — 1. These updates can all be carried out in 
parallel. For many applications, we will see that we can 
often solve (9) and (10) analytically. 



Step 2. In the second step of ADMM, we project [x^^^ + 
yfe^ j,fe+i ^ ^fe-j onto the constraint set C, i.e., 

(^fc+i^ gfc+i) ._ nc((x'=+\ r'=^+i) + (u^ t'^)). 

For the particular constraint set (7), we will show in 
Section 3.3 that the projection can be performed extremely 
efficiently. 



Step 3. Finally, we update the dual variables: 



,fe+i 



< + K 



fe+i 



.,k+l\ 



1, 



.N 



and 



,.+1 ^^ ,k ^ (^.+1 _ ^.+1) 



1 = 1, 



,N -I. 



These updates can also be carried out independently in 
parallel, for each variable block. 

3.3 Projection 

In this section we work out an efficient formula for pro- 
jection onto the constraint set C (7). To perform the 
projection 

{z,s) =nc((w,w)), 
we solve the optimization problem 

minimize ||z — w||2 + ||s — u||2 
subject to s = I?z, 

with variables z — (zi,...,Z7v) and s = (si, . . . , s^v-i), 
and where D e ji(N-i)nxNn jg ^^^^ forward difference 
operator, i.e., 

-I I 
-I I 



D 



This problem is equivalent to 



-/ / 



\Dz-v\\ 



minimize 

with variable z = (zi,...,zn). Thus to perform the 
projection we first solve the optimality condition 



D^v, 



(11) 



{I + D'D)z = 
for z, then we let s ~ Dz. 

The matrix / + D^ D is block tridiagonal, with diagonal 
blocks equal to multiples of /, and sub/supcr-diagonal 
blocks equal to — /. Let LL^ be the Cholesky factorization 
oi I + D^D. It is easy to show that L is block banded with 
the form 

'h.i 
h.i h.,2 

^3,2 ^3.3 



L^ 



l-N.N- 



-1 I 



N,N A 



/, 



where ® denotes the Kronecker product. The coefficients 
li,j can be explicitly computed via the recursion 

^1,1 - \/2, 

h+iA = ^^/ha, h+i,i+i = y 3 — h+i,i' i = 1, . . . , N ^ 2, 



N.N-l 



= -l/l 



N-l.N-l, 



N.N 



= J2-1 



N,N-1- 



The coefficients only need to be computed once, before the 
projection operator is applied. 

The projection therefore consists of the following steps 

(1) Form b :~ w + D'^v: 

&i:=wi-ui, feAT := Wat + t;jv_i, 



hi := Wi + (wj_] 



,iV-l. 



(2) Solve Ly = h: 

j/i := {l/lis)bi, 

Vi ■= {^/k,i){bi - k,i~iyi^i), i = 2,...,N. 

(3) Solve L^z = y: 

ZN ■= {'^/lN,N)yN, 

Zi := {l/ks){yi - k+iAZi+i), i = N -l,...,l. 

(4) Set s = Dz: 

Si := Zi+i - Zi, i = I,.. . ,N -1. 

Thus, we see that we can perform the projection very 
efficiently, in 0{Nn) flops (floating-point operations). In 
fact, if we pre-compute the inverses l/li,i, i = 1, . . . ,N, the 
only operations that are required are multiplication, addi- 
tion, and subtraction. We do not need to perforin division, 
which can be expensive on some hardware platforms. 

4. EXAMPLES 

4-1 £i Mean filtering 

Consider a sequence of vector random variables 

r,^AA(y„S), 1 = 1,.. .,iV, 

where j/i € R" is the mean, and S e S^ is the covariance 
matrix. We assume that the covariance matrix is known, 
but the mean of the process is unknown. Given a sequence 
of observations j/i, . . . , y^, our goal is to estimate the mean 
under the assumption that it is piecewise constant, i.e., 
j/i+i = yi for many values of i. 

In the Fused Group Lasso method, we obtain our estimates 
by solving 

N N-1 

^T^iVi- XifT.-^'^iyt -Xi) + XY^ 



minimize 



n 2 



i=l 

subject to ri ~ x^+i 
with variables x\, 



i-1, 



,iV-l, 



' N- 



,rjv-i. Let x-Y, . . . ,x^, 
_Y denote an optimal point, our estimates of 



, xat, ri,. 



yi,...,yN are x-y,...,Xj^. 

This problem is clearly in the form (6), with 



$»(a;j 



1 



{Vt 



,f^-\y,^Xi), *,(r,) = A||r,||: 



ADMM steps. For this problem, steps (9) and (10) of 
ADMM can be further simplified. Step (9) involves mini- 
mizing an unconstrained quadratic function in the variable 
Xi, and can be written as 



^fc+i 



(E-i+p/)-i(E-iy,+p(zf-^if)). 



Step (10) is 



r- := argniin 



{\\\nh + {p/2)\\n^s1 + t1\\l}, 



which simplifies to 

rf+i^5,/,(s,*-t,*), (12) 

where 5^ is the vector soft thresholding operator, defined 



as 



S^{a)^{l-K/\\a\\2)+a, 5«(0) = 0. 

Here the notation (w)+ ~ max{0,f} denotes the positive 
part of the vector v. (For details see Boyd et al. [2011].) 



Variations. In some problems, we might expect that 
individual components of Xt will be piecewise constant, in 
which case we can instead use the standard Fused Lasso 
method. In the standard Fused Lasso method we solve 

7V-1 



minimize 



N ^ 
i=l i=l 

subject to ri 



Xi 



1, 



,N, 



Xi+l 

with variables xi, . . . , xn, ti, . . . , rjsi-i. The ADMM up- 
dates are the same, except that instead of doing vector 
soft thresholding for step (10), we perform scalar compo- 
nentwise soft thresholding, i.e., 

ir1^'h=Sx/,{{s'[~t1),), j = l,...,n. 

4.2 £1 Variance filtering 

Consider a sequence of vector random variables (of dimen- 
sion n) 

Y,^Af{0,j:,), i^l,...,N, 

where S^ G S" is the covariance matrix for Yi (which 
we assume is fixed but unknown). Given observations 
of j/i, . . . , 2/Ar, our goal is to estimate the sequence of 
covariance matrices Si,..., Eat, under the assumption 
that it is piecewise constant, i.e., it is often the case that 
Si+i — Si. In order to obtain a convex problem, we use 
the inverse covariances Xi — S^ as our variables. 

The Fused Group Lasso method for this problem involves 
solving 

N N-1 

minimize ^Tr{X,yiyf) - logdetXi + A ^ \\Ri\\F 

i=l i=l 



subject to Ri — Xi^i — Xi, i = 1, . . . , 

where our variables are i?i G S", i = 1, 

X, G S", i = 1,...,7V. Here, 



, iV - 1, and 



-Ri 



^V{RJR^) 



Rn-1 



is the Frobenius norm of i?^. Let X*, . . . , X'^, R\, 

denote an optimal point, our estimates of Si, ... , S^f are 

{xt)-\...,{x*^)-\ 



ADMM steps. It is easy to see that steps (9) and (10) 
simplify for this problem. Step (9) requires solving 

Xf+i := argmin{$,(X,) + {p/2)\\X, - Z\ + t/f H^}, 

where 

$,(X,) = Tr(X,y,yf ) - logdctX,. 
This update can be solved analytically, as follows. 

(1) Compute the eigenvalue decomposition of 

p {Z^ - t/f ) - y,yj = QAQ^ 

where A — diag(Ai, . . . , A„). 

(2) Now let 



Xj + J^Tap 

M,:= , J = l,...,n. 

(3) Finally, we set 

Af+^ = gdiag(Mi,...,Mn)g^. 

For details of this derivation, see Section 6.5 in Boyd et al. 
[2011]. 

Step (10) is 

r1+' := argmin{A||i?,||f + (p/2)||i?, - S^ + T.^WJ}, 
Ri 

which simplifies to 

^i — ^\/pWi - J^i ), 
where S^ is a matrix soft threshold operator, defined as 

S^iA) = {1 - k/\\A\\f)+A, 5«(0)=0. 



Variations. As with £1 mean filtering, we can replace the 
Frobenius norm penalty with a componentwise vector ii- 
norm penalty on Ri to get the problem 

N N-1 

minimize ^ Tv{Xiy^yJ) - log dot A^ + A ^ Il-Rjjl 1 

i=l i=l 

subject to Ri ~ Xi+i — Xi, i ^ I,. . . ,N ~ 1, 

with variables Ri, . . . , Rn-i e S", and Ai, . . . , Xn & S" , 
and where 

piii = Ei^j-fei- 

Again, the ADMM updates are the same, the only differ- 
ence is that in step (10) we replace matrix soft thresholding 
with a componentwise soft threshold, i.e.. 



{Ri )Lrn — Sx/p{{S,- - Tj )l^rn), 



for I = 1, 



,n. 



4.3 ii Mean and variance filtering 

Consider a sequence of vector random variables 

Y, -AA(y„S,), i^l,...,N, 

where yi G R" is the mean, and S^ G S^ is the covariance 
matrix for Yi. We assume that the mean and covariance 
matrix of the process is unknown. Given observations 
j/i , . . . , ypf , our goal is to estimate the mean and the 
sequence of covariance matrices Si,...,SAr, under the 
assumption that they are piecewise constant, i.e., it is 



often the case that yi+i = yi and S^+i = E^. To obtain a 
convex optimization problem, we use 

A^i = — — zjj , rrii = Zjj Xi, 

as our variables. In the Fused Group Lasso method, we 
obtain our estimates by solving 



N 

E 



minimize V -(1/2) logdet(-X,) - Tr{X,yiyf 



-mf y, - (1/4) TriXr^m^mf ) 

+Ai J2 \\nh + X2j2 ll^^ll^ 

subject to Ti = rrii+i — m^, i = 1, . . . , A^ — 1, 
Ri — Xij^i — Xi, i — 1, . . . ,N — 1, 

with variables ri, . . . , rAr_i G R", rrii, . . . , tti^v G R-", 
i?i, . . . , Rpf—i G S , and JC'i, . . . , Xj\j G S_|_. 



ADMM steps. This problem is also in the form (6), how- 
ever, as far as we are aware, there is no analytical formula 
for steps (9) and (10). To carry out these updates, we must 
solve semidefinite programs (SDPs), for which there are a 
number of efficient and reliable software packages (Toh 
et al. [1999], Sturm [1999]). 

5. NUMERICAL EXAMPLE 

In this section we solve an instance of ii mean filtering 
with n = 1, E = 1, and N = 400, using the standard Fused 
Lasso method. To improve convergence of the ADMM 
algorithm, we use over-relaxation with a = 1.8, see Boyd 
et al. [2011]. The parameter A is chosen as approximately 
10% of Aniax, where Amax is the largest value that results 
in a non-constant mean estimate. Here, Amax ~ 108 and 
so A = 10. We use an absolute plus relative error stopping 
criterion, with e^^^ = 10^^ and e"^"' = 10""^. Figure 1 shows 
convergence of the primal and dual residuals. The resulting 
estimates of the means are shown in Figure 2. 
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Fig. 2. Estimated means (solid line), true means (dashed 
line) and measurements (crosses). 

We solved the same £i mean filtering problem using CVX, 
a package for specifying and solving convex optimization 
problems (Grant and Boyd [2011]). CVX calls generic 
SDP solvers SeDuMi (Toh et al. [1999]) or SDPT3 (Sturm 
[1999]) to solve the problem. While these solvers are 
reliable for wide classes of optimization problems, and 
exploit sparsity in the problem formulation, they are 
not customized for particular problem families, such as 
ours. The computation time for CVX is approximately 
20 seconds. Our ADMM algorithm (implemented in C), 
took 2.2 milliseconds to produce the same estimates. 
Thus, our algorithm is approximately 10000 times faster 
compared with generic optimization packages. Indeed, our 
implementation does not exploit the fact that steps 1 and 
3 of ADMM can be implemented independently in parallel 
for each measurement. Parallelizing steps 1 and 3 of the 
computation can lead to further speedups. For example, 
simple multi-threading on a quad-core CPU would result 
in a further 4x speed-up. 



6. CONCLUSIONS 
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Fig. 1. Residual convergence: Primal residual Cp (solid 
line), and dual residual Cd (dashed line). 



In this paper we derived an efficient and scalable method 
for an optimization problem (6) that has a variety of ap- 
plications in control and estimation. Our custom method 
exploits the structure of the problem via a distributed 
optimization framework. In many applications, each step 
of the method is a simple update that typically involves 
solving a set of linear equations, matrix multiplication, 
or thresholding, for which there are exceedingly efficient 
libraries. In numerical examples we have shown that we 
can solve problems such as ii mean and variance filtering 
many orders of magnitude faster than generic optimization 
solvers such as SeDuMi or SDPT3. 

The only tuning parameter for our method is the reg- 
ularization parameter p. Finding an optimal p is not a 
straightforward problem, but Boyd et al. [2011] contains 
many heuristics that work well in practice. For the ii mean 
filtering example, we find that setting p « A works well, 
but we do not have a formal justification. 
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